arXiv:1604.01729v2  [cs.CL]  29  Nov  2016 


Improving  LSTM-based  Video  Description 
with  Linguistic  Knowledge  Mined  from  Text 


Subhashini  Venugopalan 

UT  Austin 


Lisa  Anne  Hendricks 

UC  Berkeley 


vsub@cs  .  utexas  .  edu  lisa_anne@berkeley  .  edu 


Raymond  Mooney  Kate  Saenko 

UT  Austin  Boston  University 

mooney@cs . utexas . edu  saenko@bu . edu 


Abstract 

This  paper  investigates  how  linguistic  knowl¬ 
edge  mined  from  large  text  corpora  can  aid  the 
generation  of  natural  language  descriptions  of 
videos.  Specifically,  we  integrate  both  a  neu¬ 
ral  language  model  and  distributional  seman¬ 
tics  trained  on  large  text  corpora  into  a  recent 
LSTM-based  architecture  for  video  descrip¬ 
tion.  We  evaluate  our  approach  on  a  collection 
of  Youtube  videos  as  well  as  two  large  movie 
description  datasets  showing  significant  im¬ 
provements  in  grammaticality  while  modestly 
improving  descriptive  quality. 


1  Introduction 


The  ability  to  automatically  describe  videos  in  nat¬ 
ural  language  (NL)  enables  many  important  appli¬ 
cations  including  content-based  video  retrieval  and 
video  description  for  the  visually  impaired.  The 
most  effective  recent  methods  ([Venugopalan  et  al.T 
2015at  [Yao  et  al.,  2015|)  use  recurrent  neural  net¬ 
works  (RNN)  and  treat  the  problem  as  machine 
translation  (MT)  from  video  to  natural  language. 
Deep  learning  methods  such  as  RNNs  need  large 
training  corpora;  however,  there  is  a  lack  of  high- 
quality  paired  video- sentence  data.  In  contrast,  raw 
text  corpora  are  widely  available  and  exhibit  rich 
linguistic  structure  that  can  aid  video  description. 
Most  work  in  statistical  MT  utilizes  both  a  language 
model  trained  on  a  large  corpus  of  monolingual  tar¬ 
get  language  data  as  well  as  a  translation  model 
trained  on  more  limited  parallel  bilingual  data.  This 
paper  explores  methods  to  incorporate  knowledge 
from  language  corpora  to  capture  general  linguistic 
regularities  to  aid  video  description. 


This  paper  integrates  linguistic  information  into 
a  video-captioning  model  based  on  Long  Short 
Term  Memory  (LSTM) 
jber,  1997])  RNNs  which 
performance  on  the  task.  Further,  LSTMs  are  also 
effective  as  language  models  (LMs)  (jSundermeyer 
et  al.,  20T0|).  Our  first  approach  (early  fusion)  is 
to  pre-train  the  network  on  plain  text  before  train¬ 
ing  on  parallel  video-text  corpora.  Our 
proaches,  inspired  by  recent  MT  work 
al.,  2015|),  integrate  an  LSTM  LM  with  the  existing 
video-to-text  model.  Furthermore,  we  also  explore 
replacing  the  standard  one-hot  word  encoding  with 
distributional  vectors  trained  on  external  corpora. 

We  present  detailed  comparisons  between  the  ap¬ 
proaches,  evaluating  them  on  a  standard  Youtube 
corpus  and  two  recent  large  movie  description 
datasets.  The  results  demonstrate  significant  im¬ 
provements  in  grammaticality  of  the  descriptions 
(as  determined  by  crowdsourced  human  evaluations) 
and  more  modest  improvements  in  descriptive  qual¬ 
ity  (as  determined  by  both  crowdsourced  human 
judgements  and  standard  automated  comparison  to 
human-generated  descriptions).  Our  main  contribu¬ 
tions  are  1)  multiple  ways  to  incorporate  knowledge 
from  external  text  into  an  existing  captioning  model, 
2)  extensive  experiments  comparing  the  methods  on 
three  large  video-caption  datasets,  and  3)  human 
judgements  to  show  that  external  linguistic  knowl¬ 
edge  has  a  significant  impact  on  grammar. 


next  two  ap- 
(jGulcehre  et 


(|Hochreiter  and  Schmidhu- 
have  shown  state-of-the-art 


2  LSTM-based  Video  Description 

We  use  the  successful  S2VT  video  description 
framework  from  Venugopalan  et  al.  (2015a|)  as  our 


Text  Corpora  <s 

‘cat 
/  coc 


Knowledge  from 


CNN  /  CNN  /  \  CNN 


I  LSTM  I - »|  LSTM  | - •\  LSTM  | - >|  LSTM  | - >|  LSTM  | - LSTM  | - 1  LSTM  | - LSTM  | 


talking 


Decoding  stage 


Figure  1:  The  S2VT  architecture  encodes  a  sequence  of  frames 
and  decodes  them  to  a  sentence.  We  propose  to  add  knowledge 
from  text  corpora  to  enhance  the  quality  of  video  description. 

underlying  model  and  describe  it  briefly  here.  S2VT 


uses  a  sequence  to  sequence  approach  ([Sutskever 
et  ah,  2014}  |Cho  et  ah,  2014|)  that  maps  an  input 
X  =  (xi, ,  xt)  video  frame  feature  sequence  to  a 
fixed  dimensional  vector  and  then  decodes  this  into 
a  sequence  of  output  words  y  =  (2/1,  ••• ,  Vn)- 
As  shown  in  Fig.  it  employs  a  stack  of  two 
LSTM  layers.  The  input  x  to  the  first  LSTM  layer 
is  a  sequence  of  frame  features  obtained  from  the 
penultimate  layer  (fcr)  of  a  Convolutional  Neural 
Network  (CNN)  after  the  ReLu  operation.  This 
LSTM  layer  encodes  the  video  sequence.  At  each 
time  step,  the  hidden  control  state  ht  is  provided  as 
input  to  a  second  LSTM  layer.  After  viewing  all  the 
frames,  the  second  LSTM  layer  learns  to  decode  this 
state  into  a  sequence  of  words.  This  can  be  viewed 
as  using  one  LSTM  layer  to  model  the  visual  fea¬ 
tures,  and  a  second  LSTM  layer  to  model  language 
conditioned  on  the  visual  representation.  We  modify 
this  architecture  to  incorporate  linguistic  knowledge 
at  different  stages  of  the  training  and  generation  pro¬ 
cess.  Although  our  methods  use  S2VT,  they  are 
sufficiently  general  and  could  be  incorporated  into 
other  CNN-RNN  based  captioning  models. 

3  Approach 


Existing  visual  captioning  models  ([Vinyals  et  al., 
2015[  Donahue  et  al.,  2015 )  are  trained  solely  on  text 
from  the  caption  datasets  and  tend  to  exhibit  some 
linguistic  irregularities  associated  with  a  restricted 
language  model  and  a  small  vocabulary.  Here,  we 
investigate  several  techniques  to  integrate  prior  lin¬ 
guistic  knowledge  into  a  CNN/LSTM-based  net¬ 
work  for  video  to  text  (S2VT)  and  evaluate  their  ef¬ 
fectiveness  at  improving  the  overall  description. 


Early  Fusion.  Our  first  approach  {early  fusion),  is 
to  pre-train  portions  of  the  network  modeling  lan¬ 
guage  on  large  corpora  of  raw  NL  text  and  then 
continue  “fine-tuning”  the  parameters  on  the  paired 
video-text  corpus.  An  LSTM  model  learns  to  esti¬ 
mate  the  probability  of  an  output  sequence  given  an 
input  sequence.  To  learn  a  language  model,  we  train 
the  LSTM  layer  to  predict  the  next  word  given  the 
previous  words.  Following  the  S2VT  architecture, 
we  embed  one-hot  encoded  words  in  lower  dimen¬ 
sional  vectors.  The  network  is  trained  on  web-scale 
text  corpora  and  the  parameters  are  learned  through 
backpropagation  using  stochastic  gradient  descent}^ 
The  weights  from  this  network  are  then  used  to  ini¬ 
tialize  the  embedding  and  weights  of  the  LSTM  lay¬ 
ers  of  S2VT,  which  is  then  trained  on  video-text 
data.  This  trained  LM  is  also  used  as  the  LSTM  LM 
in  the  late  and  deep  fusion  models. 

Late  Fusion.  Our  late  fusion  approach  is  similar 
to  how  neural  machine  translation  models  incorpo¬ 
rate  a  trained  language  model  during  decoding.  At 
each  step  of  sentence  generation,  the  video  caption 
model  proposes  a  distribution  over  the  vocabulary. 
We  then  use  the  language  model  to  re-score  the  fi¬ 
nal  output  by  considering  the  weighted  average  of 
the  sum  of  scores  proposed  by  the  LM  as  well  as  the 
S2VT  video-description  model  (VM).  More  specif¬ 
ically,  if  yt  denotes  the  output  at  time  step  t,  and  if 
PVM  and  plm  denote  the  proposal  distributions  of 
the  video  captioning  model,  and  the  language  mod¬ 
els  respectively,  then  for  all  words  y'  ^  V  in  the 
vocabulary  we  can  recompute  the  score  of  each  new 
word,  p{yt  =  y')  as: 

a  ■  PvMiVt  =  y')  +  (1  -  a)  •  PLuiVt  =  2/0  (1) 

Hyper-parameter  a  is  tuned  on  the  validation  set. 

Deep  Fusion.  In  the  deep  fusion  approach  (Fig.[^, 
we  integrate  the  LM  a  step  deeper  in  the  genera¬ 
tion  process  by  concatenating  the  hidden  state  of  the 
language  model  LSTM  (hf^)  with  the  hidden  state 
of  the  S2VT  video  description  model  {hj^)  and 
use  the  combined  latent  vector  to  predict  the  out¬ 
put  word.  This  is  similar  to  the  technique  proposed 
by|Gulcehre  et  al.  (2015|)  for  incorporating  language 
models  trained  on  monolingual  corpora  for  machine 
translation.  However,  our  approach  differs  in  two 

^The  LM  was  trained  to  achieve  a  perplexity  of  120 


also  explore  variations  where  the  model  predicts 
both  the  one-hot  word  (trained  on  the  softmax  loss), 
as  well  as  predicting  the  distributional  vector  from 
the  LSTM  hidden  state  using  Euclidean  loss  as  the 
objective.  Here  the  output  vector  {yt)  is  computed 
as  yt  —  {Wght  +  bg),  and  the  loss  is  given  by: 

glove)  —  “1“  ^g)  ~  (3) 


Figure  2:  Illustration  of  our  late  and  deep  fusion  ap¬ 
proaches  to  integrate  an  independently  trained  LM  to  aid 
video  captioning.  The  deep  fusion  model  learns  jointly 
from  the  hidden  representations  of  the  LM  and  S2VT 
video-to-text  model  (Vid-LSTM),  whereas  the  late  fusion 
re-scores  the  softmax  output  of  the  video-to-text  model. 

key  ways:  (1)  we  only  concatenate  the  hidden  states 
of  the  S2VT  LSTM  and  language  LSTM  and  do  not 
use  any  additional  context  information,  (2)  we  fix 
the  weights  of  the  LSTM  language  model  but  train 
the  full  video  captioning  network.  In  this  case,  the 
probability  of  the  predicted  word  at  time  step  t  is: 

p{yt\y<t,  x)  (X  exp(Wf(/i]^^,  +  h)  (2) 


where  ht  is  the  LSTM  output,  Wgiove  is  the  word’s 
GloVe  embedding  and  W,  b  are  weights  and  biases. 
The  network  then  essentially  becomes  a  multi-task 
model  with  two  loss  functions.  However,  we  use 
this  loss  only  to  influence  the  weights  learned  by  the 
network,  the  predicted  word  embedding  is  not  used. 


Ensembling.  The  overall  loss  function  of  the 
video-caption  network  is  non-convex,  and  difficult 
to  optimize.  In  practice,  using  an  ensemble  of  net¬ 
works  trained  slightly  differently  can  improve  per¬ 
formance  (Hansen  and  Salamon,  1990).  In  our  work 
we  also  present  results  of  an  ensemble  by  averaging 
the  predictions  of  the  best  performing  models. 


where  x  is  the  visual  feature  input,  W  is  the  weight 
matrix,  and  b  the  biases.  We  avoid  tuning  the  LSTM 
LM  to  prevent  overwriting  already  learned  weights 
of  a  strong  language  model.  But  we  train  the  full 
video  caption  model  to  incorporate  the  LM  outputs 
while  training  on  the  caption  domain. 

Distributional  Word  Representations.  The 

S2VT  network,  like  most  image  and  video  cap¬ 
tioning  models,  represents  words  using  a  1-of-N 
(one  hot)  encoding.  During  training,  the  model 
learns  to  embed  “one-hot”  words  into  a  lower 
500d  space  by  applying  a  linear  transformation. 
However,  the  embedding  is  learned  only  from 
the  limited  and  possibly  noisy  text  in  the  caption 
data.  There  are  many  approaches  (Mikolov  ~et 
al.,  2013[  [Pennington  et  al.,  2014[)  that  use  large 
text  corpora  to  learn  vector- space  representations 
of  words  that  capture  fine-grained  semantic  and 
syntactic  regularities.  We  propose  to  take  advantage 
of  these  to  aid  video  description.  Specifically,  we 
replace  the  embedding  matrix  from  one-hot  vectors 
and  instead  use  300-dimensional  GloVe  vectors 
([Pennington  et  al.,  2014|)  pre-trained  on  6B  tokens 
from  Gigaword  and  Wikipedia  2014.  In  addition 
to  using  the  distributional  vectors  for  the  input,  we 


4  Experiments 


Datasets.  Our  language  model  was  trained  on 
sentences  from  Gigaword,  BNC,  UkWaC,  and 
Wikipedia.  The  vocabulary  consisted  of  72,700 
most  frequent  tokens  also  containing  GloVe  embed- 


al.  (2015a 

),  we  compare  our  models  on  the  Youtube 

dataset  (C 

len  and  Dolan,  2011),  as  well  as  two  large 

movie  description  corpora:  MPII-MD  (Rohrbach  et 
al.,  2015[)  and  M-VAD  ([Torabi  et  al.,20T5]). 


Evaluation  Metrics.  We  evaluate  performance 
using  machine  translation  (MT)  metrics  ME¬ 


TEOR  (Denkowski  and  Lavie,  2014|)  and  BLEU 
([Papineni  et  al.,  2002[)  to  compare  the  machine¬ 
generated  descriptions  to  human  ones.  Eor  the 
movie  corpora  which  have  just  a  single  description 
we  use  only  METEOR  which  is  more  robust. 


Human  Evaluation.  We  also  obtain  human  judge¬ 
ments  using  Amazon  Turk  on  a  random  subset  of 
200  video  clips  for  each  dataset.  Each  sentence  was 
rated  by  3  workers  on  a  Likert  scale  of  1  to  5  (higher 
is  better)  for  relevance  and  grammar.  No  video  was 
provided  during  grammar  evaluation.  Eor  movies, 
due  to  copyright,  we  only  evaluate  on  grammar. 


Model 

METEOR 

B-4 

Relevance 

Grammar 

S2VT 

29.2 

37.0 

2.06 

3.76 

Early  Fusion 

29.6 

37.6 

- 

- 

Late  Fusion 

29.4 

37.2 

- 

- 

Deep  Fusion 

29.6 

39.3 

- 

- 

Glove 

30.0 

37.0 

- 

- 

Glove+Deep 
-  Web  Corpus 

30.3 

38.1 

2.12 

4.05* 

-  In-Domain 

30.3 

38.8 

2.21* 

4.17* 

Ensemble 

31.4 

42.1 

2.24* 

4.20* 

Human 

- 

- 

4.52 

4.47 

Table  1:  Youtube  dataset:  METEOR  and  BLEU@4  in  %, 
and  human  ratings  (1-5)  on  relevance  and  grammar.  Best 
results  in  bold,  *  indicates  significant  over  S2VT. 


Model 

MPII-MD 

M-VAD 

METEOR  Grammar 

METEOR 

Grammar 

S2VTt 

6.5  2.6 

6.6 

2.2 

Early  Fusion 

6.7 

6.8 

- 

Late  Fusion 

6.5 

6.7 

- 

Deep  Fusion 

6.8 

6.8 

- 

Glove 

6.7  3.9* 

6.7 

3.1* 

Glove+Deep 

6.8  4.1* 

6.7 

3.3* 

Table  2:  Movie  Corpora:  METEOR  (%)  and  human 
grammar  ratings  (1-5,  higher  is  better).  Best  results  in 
bold,  *  indicates  significant  over  S2VT. 

S2VT:  Someone  sits  in  the  bed. 

Glove:  Someone  sits  on  the  couch  and  watches  her  phone. 
Glove+Deep:  Someone  sits  on  the  couch,  watching  her, 
her  feet  on  her  lap. 

GT:  Someone  drops  the  flowers  and  kisses  someone. 


4.1  Youtube  Video  Dataset  Results 

Comparison  of  the  proposed  techniques  in  Table 
shows  that  Deep  Fusion  performs  well  on  both  ME¬ 
TEOR  and  BLEU;  incorporating  Glove  embeddings 
substantially  increases  METEOR,  and  combining 
them  both  does  best.  Our  final  model  is  an  ensem¬ 
ble  (weighted  average)  of  the  Glove,  and  the  two 
Glove+Deep  Fusion  models  trained  on  the  external 
and  in-domain  COCO  (|Lin  et  al.,  2014|)  sentences. 
We  note  here  that  the  state-of-the-art  on  this  dataset 


is  achieved  by  HRNE  ([Pan  et  al.,  2015|)  (METEOR 
33.1)  which  proposes  a  superior  visual  processing 
pipeline  using  attention  to  encode  the  video. 

Human  ratings  also  correlate  well  with  the  ME¬ 
TEOR  scores,  confirming  that  our  methods  give  a 
modest  improvement  in  descriptive  quality.  How¬ 
ever,  incorporating  linguistic  knowledge  signifi- 


cantl)|j  improves  the  grammaticality  of  the  results, 
making  them  more  comprehensible  to  human  users. 
Embedding  Influence.  We  experimented  multiple 
ways  to  incorporate  word  embeddings:  (1)  GloVe  in¬ 
put:  Replacing  one-hot  vectors  with  GloVe  on  the 
LSTM  input  performed  best.  (2)  Fine-tuning:  Ini¬ 
tializing  with  GloVe  and  subsequently  fine-tuning 
the  embedding  matrix  reduced  validation  results  by 
0.4  METEOR.  (3)  Input  and  Predict.  Training  the 
LSTM  to  accept  and  predict  GloVe  vectors,  as  de¬ 
scribed  in  Section  performed  similar  to  (1).  All 
scores  reported  in  Tables  and  [^correspond  to  the 
setting  in  (1)  with  GloVe  embeddings  only  as  input. 


^Using  the  Wilcoxon  Signed-Rank  test,  results  were  signifi¬ 
cant  with  p  <  0.02  on  relevance  and  p  <  0.001  on  grammar. 


Figure  3:  Two  frames  from  a  clip.  Models  generate  visu¬ 
ally  relevant  sentences  but  differ  from  groundtruth  (GT). 

4.2  Movie  Description  Results 

Results  on  the  movie  corpora  are  presented  in  Ta¬ 
ble  [^  Both  MPII-MD  and  M-VAD  have  only  a  sin¬ 
gle  ground  truth  description  for  each  video,  which 
makes  both  learning  and  evaluation  very  challeng¬ 
ing  (E.g.  FigJ^.  METEOR  scores  are  fairly  low 
on  both  datasets  since  generated  sentences  are  com¬ 
pared  to  a  single  reference  translation.  S2VT1^  is  a 
re-implementation  of  the  base  S2VT  model  with  the 
new  vocabulary  and  architecture  (embedding  dimen¬ 
sion).  We  observe  that  the  ability  of  external  lin¬ 
guistic  knowledge  to  improve  METEOR  scores  on 
these  challenging  datasets  is  small  but  consistent. 
Again,  human  evaluations  show  significant  (with 
p  <  0.0001)  improvement  in  grammatical  quality. 


5  Related  Work 


Following  the  success  of  LSTM-based  models  on 
Machine  Translation  dSutskever  et  al.,  20 14^  |Bah-| 


danau  et  al.,  2015|),  and  image  captioning  ([Vinyals 
et  al.,  2013}  [Donahue  et  al.,  2015|),  recent  video  de¬ 
scription  works  ([Venugopalan  et  al.,  2015b,  Venu- 
gopalan  et  al.,  20f5at  |Yao  et  al.,  2015|)  propose 
CNN-RNN  based  models  that  generate  a  vector  rep¬ 
resentation  for  the  video  and  “decode”  it  using  an 
LSTM  sequence  model  to  generate  a  description. 
Venugopalan  et  al.  (2015b|)  also  incorporate  exter¬ 
nal  data  such  as  images  with  captions  to  improve 
video  description,  however  in  this  work,  our  focus 


is  on  integrating  external  linguistic  knowledge  for 
video  captioning.  We  specifically  investigate  the  use 
of  distributional  semantic  embeddings  and  LSTM- 
based  language  models  trained  on  external  text  cor¬ 
pora  to  aid  existing  CNN-RNN  based  video  descrip¬ 
tion  models. 

LSTMs  have  proven  to  be  very  effective  language 


models  ( Sundermeyeret  al.,  2010).  Gulcehre  et 


al.  (2015)  developed  an  LSTM  model  for  machine 


translation  that  incorporates  a  monolingual  language 
model  for  the  target  language  showing  improved  re¬ 
sults.  We  utilize  similar  approaches  (late  fusion, 
deep  fusion)  to  train  an  LSTM  for  translating  video 
to  text  that  exploits  large  monolingual-English  cor¬ 
pora  (Wikipedia,  BNC,  UkWac)  to  improve  RNN 
based  video  description  networks.  However,  unlike 
Gulcehre  et  al.  (20T5])  where  the  monolingual  LM  is 


used  only  to  tune  specific  parameters  of  the  transla¬ 
tion  network,  the  key  advantage  of  our  approach  is 
that  the  output  of  the  monolingual  language  model  is 
used  (as  an  input)  when  training  the  full  underlying 
video  description  network. 


Contemporaneous  to  us,  Yu  et  al.  (2015),  Pan  et 


al.  (2015|)  and|Ballas  et  al.  (20T^  propose  video  de¬ 


scription  models  focusing  primarily  on  improving 
the  video  representation  itself  using  a  hierarchical 
visual  pipeline,  and  attention.  Without  the  attention 
mechanism  their  models  achieve  METEOR  scores 
of  31.1,  32.1  and  31.6  respectively  on  the  Youtube 
dataset.  The  interesting  aspect,  as  demonstrated  in 
our  experiments  (Table [^,  is  that  the  contribution  of 
language  alone  is  considerable  and  only  slightly  less 
than  the  visual  contribution  on  this  dataset.  Hence, 
it  is  important  to  focus  on  both  aspects  to  generate 
better  descriptions. 

6  Conclusion 


This  paper  investigates  multiple  techniques  to  incor¬ 
porate  linguistic  knowledge  from  text  corpora  to  aid 
video  captioning.  We  empirically  evaluate  our  ap¬ 
proaches  on  Youtube  clips  as  well  as  two  movie  de¬ 
scription  corpora.  Our  results  show  significant  im¬ 
provements  on  human  evaluations  of  grammar  while 
modestly  improving  the  overall  descriptive  quality 
of  sentences  on  all  datasets.  While  the  proposed 
techniques  are  evaluated  on  a  specific  video-caption 
network,  they  are  generic  and  can  be  applied  to  other 
video  and  image  captioning  models  ([Hendricks  et 


S2VT:  The  sunsets  down  on  the  homestead. 

Glove:  The  unk  mountains  of  the  river,  which  is  filled  with  a 
large  sea. 

Glove+Deep:  The  hogwarts  express  chugs  through  the 
barren  moorland. 

GT:  Steam  billows  from  the  funnel  as  the  hogwarts  express 
travels  through  the  rain  beside  the  edge  of  a  vast  lake. 


S2VT:  Someone  pulls  up  the  car. 

Glove:  Someone  is  in  the  car ,  looking  out  of  the  window 
Glove+Deep:  The  car  is  coming  down  the  street ,  and 
someone  is  waiting  for  the  car. 

GT:  He  slows  down  in  front  of  one  house  with  a  triple  garage  and 
box  tree  on  the  front  lawn  and  pulls  up  onto  the  driveway. 


S2VT:  Someone  is  standing  in  the  hall. 

Glove:  Someone  looks  at  someone ,  then  turns  to  someone. 
Glove+Deep:  Someone  looks  at  someone ,  who  is  still 
standing  in  the  doorway ,  watching  the  tv. 

GT :  Someone  thrusts  a  wet  umbrella  at  someone. 

S2VT:  Someone  is  in  the  kitchen. 

Glove:  Someone  walks  into  the  kitchen  and  sits  down. 
Glove+Deep:  Someone  walks  over  to  the  window  and  looks  out. 
GT:  Someone  is  still  eating  and  watching  television. 


S2VT:  Someone  is  standing  in  front  of  a  large ,  closed-down 
gas  station  by  the  side  of  the  road. 

Glove:  Someone  is  sitting  on  the  ground,  his  head  bowed. 
Glove+Deep:  Someone  is  walking  along  the  sidewalk,  a  tall 
camel,  a  man  in  a  ferret,  a  bloodhound  drooling. 

GT :  A  magnificent  creature  stands  in  front  of  them. 


S2VT:  Someone  takes  a  head.  The  man  on  a  door. 

Glove:  Someone  unk  her  gaze.  Someone  and  someone  dance. 
Glove+Deep:  Someone  and  someone  watch  the  dance  floor. 
Someone  and  someone  dance. 

GT:  He  leads  her  to  the  dance  floor  and  flings  off  his  jacket. 
He  raises  her  arms  above  her  head. 


S2VT:  Someone  and  someone  pull  up  to  the  car  .  Someone 
looks  up  at  the  departing  security  window. 

Glove:  Someone  pulls  out  a  car.  Someone  glances  at  the 
wheel,  then  turns  to  the  side  of  the  road. 

Glove+Deep:  Someone  pulls  out  a  pair  of  doors  and  slides  out 
of  the  car.  He  pulls  out  a  pistol. 

GT:  Drawing  his  gun,  someone  returns  fire.  Someone  cowers  . 
The  pick-up  swerves  onto  the  one-way  street  and  jams  itself 
alongside  the  delta,  mangling  the  convertibles  headlight  and 
someone.  The  vehicles  separate.  Someone  bashes  the  pick-up. 


S2VT:  Someone,  someone  walks  into  the  window. 

Glove:  Someone  is  in  the  back  of  the  car. 

Glove+Deep:  Someone  grabs  the  phone  and  punches  it  at 
someone. 

GT:  Someone  grabs  the  tablecloth. 


Figure  4:  Representative  frames  from  clips  in  the  movie  de¬ 
scription  corpora.  S2VT  is  the  baseline  model,  Glove  indicates 
the  model  trained  with  input  Glove  vectors,  and  Glove-i-Deep 
uses  input  Glove  vectors  with  the  Deep  Fusion  approach.  GT 
indicates  groundtruth  sentence. 


al.,  2016[  jVenugopalan  et  al.,  2016|).  The  code  and 
models  are  shared  on  http  :  //vsubhashini  . 
git hub . io/ language_f usion . html , 


Acknowledgements 

This  work  was  supported  by  NSE  awards  IIS- 
1427425  and  IIS-1212798,  and  ONR  ATL  Grant 
NOOOl 4- 11-1-010,  and  DARPA  under  AERL  grant 
EA8750- 13-2-0026.  Raymond  Mooney  and  Kate 
Saenko  also  acknowledge  support  from  a  Google 
grant.  Lisa  Anne  Hendricks  is  supported  by  the  Na¬ 
tional  Defense  Science  and  Engineering  Graduate 
(NDSEG)  Eellowship. 


References 

[Bahdanau  et  aL2015]  Dzmitry  Bahdanau,  Kyunghyun 
Cho,  and  Yoshua  Bengio.  2015.  Neural  machine 
translation  by  jointly  learning  to  align  and  translate. 
ICLR. 

[Balias  et  al.2016]  Nicolas  Balias,  Li  Yao,  Chris  Pal,  and 
Aaron  C.  Courville.  2016.  Delving  deeper  into  con¬ 
volutional  networks  for  learning  video  representations. 
ICLR. 

[Chen  and  Dolan2011]  David  Chen  and  William  Dolan. 
2011.  Collecting  highly  parallel  data  for  paraphrase 
evaluation.  In  ACL. 

[Cho  et  al.2014]  Kyunghyun  Cho,  Bart  van  Merrienboer, 
Dzmitry  Bahdanau,  and  Yoshua  Bengio.  2014.  On 
the  properties  of  neural  machine  translation:  Encoder- 
decoder  approaches.  Syntax,  Semantics  and  Structure 
in  Statistical  Translation,  page  103. 

[Denkowski  and  Lavie2014]  Michael  Denkowski  and 
Alon  Lavie.  2014.  Meteor  universal:  Language 
specific  translation  evaluation  for  any  target  language. 
In  EACL. 

[Donahue  et  al.2015]  Jeff  Donahue,  Lisa  Anne  Hen¬ 
dricks,  Sergio  Guadarrama,  Marcus  Rohrbach,  Sub- 
hashini  Venugopalan,  Kate  Saenko,  and  Trevor  Dar¬ 
rell.  2015.  Long-term  recurrent  convolutional  net¬ 
works  for  visual  recognition  and  description.  In 
CVPR. 

[Gulcehre  et  al.2015]  C.  Gulcehre,  O.  Firat,  K.  Xu, 
K.  Cho,  L.  Barrault,  H.C.  Lin,  F.  Bougares, 
H.  Schwenk,  and  Y.  Bengio.  2015.  On  using  mono¬ 
lingual  corpora  in  neural  machine  translation.  arXiv 
preprint  arXiv: 1503.03535. 

[Hansen  and  Salamonl990]  L.  K.  Hansen  and  P.  Sala- 
mon.  1990.  Neural  network  ensembles.  lEEETPAMI, 
12(10):993-1001,0ct. 

[Hendricks  et  al.2016]  Lisa  Anne  Hendricks,  Subhashini 
Venugopalan,  Marcus  Rohrbach,  Raymond  Mooney, 
Kate  Saenko,  and  Trevor  Darrell.  2016.  Deep  compo¬ 
sitional  captioning:  Describing  novel  object  categories 
without  paired  training  data.  In  CVPR. 

[Hochreiter  and  Schmidhuberl997]  Sepp  Hochreiter  and 
Jurgen  Schmidhuber.  1997.  Long  short-term  memory. 
Neural  computation,  9(8). 

[Lin  et  al.2014]  Tsung-Yi  Lin,  Michael  Make,  Serge  Be- 
longie,  James  Hays,  Pietro  Perona,  Deva  Ramanan,  Pi- 
otr  Dollar,  and  C  Lawrence  Zitnick.  2014.  Microsoft 
coco:  Common  objects  in  context.  In  ECCV. 

[Mikolov  et  al.2013]  Tomas  Mikolov,  Kai  Chen,  Greg 
Corrado,  and  Jeffrey  Dean.  2013.  Efficient  estimation 
of  word  representations  in  vector  space.  NIPS. 

[Pan  et  al.2015]  Pingbo  Pan,  Zhongwen  Xu,  Yi  Yang,  Fei 
Wu,  and  Yueting  Zhuang.  2015.  Hierarchical  recur¬ 
rent  neural  encoder  for  video  representation  with  ap¬ 
plication  to  captioning.  CVPR. 


[Papineni  et  al.2002]  Kishore  Papineni,  Salim  Roukos, 
Todd  Ward,  and  Wei-Jing  Zhu.  2002.  BLEU:  a 
method  for  automatic  evaluation  of  machine  transla¬ 
tion.  In  ACL. 

[Pennington  et  al.2014]  Jeffrey  Pennington,  Richard 
S ocher,  and  Christopher  D  Manning.  2014.  Glove: 
Global  vectors  for  word  representation.  Proceedings 
of  the  Empiricial  Methods  in  Natural  Language 
Processing  (EMNLP  2014),  12:1532-1543. 

[Rohrbach  et  al.2015]  Anna  Rohrbach,  Marcus 
Rohrbach,  Niket  Tandon,  and  Bemt  Schiele.  2015.  A 
dataset  for  movie  description.  In  CVPR. 

[Sundermeyer  et  al.2010]  M.  Sundermeyer,  R.  Schluter, 
and  H.  Ney.  2010.  Lstm  neural  networks  for  language 
modeling.  In  INTERSPEECH. 

[Sutskever  et  al.2014]  Ilya  Sutskever,  Oriol  Vinyals,  and 
Quoc  V.  Le.  2014.  Sequence  to  sequence  learning 
with  neural  networks.  In  NIPS. 

[Torabi  et  al.2015]  Atousa  Torabi,  Christopher  Pal,  Hugo 
Larochelle,  and  Aaron  Courville.  2015.  Using  de¬ 
scriptive  video  services  to  create  a  large  data  source 
for  video  annotation  research.  arXiv: 1 503. OlOVOvl. 

[Venugopalan  et  al. 2015a]  S.  Venugopalan,  M.  Rohrbach, 

J.  Donahue,  R.  Mooney,  T.  Darrell,  and  K.  Saenko. 
2015a.  Sequence  to  sequence  -  video  to  text.  ICCV. 

[Venugopalan  et  al.2015b]  Subhashini  Venugopalan,  Hui- 
juan  Xu,  Jeff  Donahue,  Marcus  Rohrbach,  Raymond 
Mooney,  and  Kate  Saenko.  2015b.  Translating  videos 
to  natural  language  using  deep  recurrent  neural  net¬ 
works.  In  AAA  CL. 

[Venugopalan  et  al.2016]  S.  Venugopalan,  L.A.  Hen¬ 
dricks,  M.  Rohrbach,  R.  Mooney,  T.  Darrell,  and 

K.  Saenko.  2016.  Captioning  images  with  diverse  ob¬ 
jects.  arXiv  preprint  arXiv: 1606.07770. 

[Vinyals  et  al.2015]  Oriol  Vinyals,  Alexander  Toshev, 
Sarny  Bengio,  and  Dumitru  Erhan.  2015.  Show  and 
tell:  A  neural  image  caption  generator.  CVPR. 

[Yao  et  al.2015]  Li  Yao,  Atousa  Torabi,  Kyunghyun  Cho, 
Nicolas  Balias,  Christopher  Pal,  Hugo  Larochelle,  and 
Aaron  Courville.  2015.  Describing  videos  by  exploit¬ 
ing  temporal  structure.  ICCV. 

[Yu  et  al.2015]  Haonan  Yu,  Jiang  Wang,  Zhiheng  Huang, 
Yi  Yang,  and  Wei  Xu.  2015.  Video  paragraph  cap¬ 
tioning  using  hierarchical  recurrent  neural  networks. 
CVPR. 


